SQL to select random mix of rows fairly [migrated]

Posted by Matt Sieker on Programmers See other posts from Programmers or by Matt Sieker
Published on 2012-06-20T21:25:19Z Indexed on 2012/06/21 3:25 UTC
Read the original article Hit count: 253

Filed under:
|

Here's my problem: I have a set of tables in a database populated with data from a client that contains product information. In addition to the basic product information, there is also information about the manufacturer, and categories for those products (a product can be in one or more categories). These categories are then referred to as "Product Categories", and which stores these products are available at. These tables are updated once a week from a feed from the customer.

Since for our purposes, some of the product categories are the same, or closely related for our purposes, there is another level of categories called "General Categories", a general category can have one or more product categories.

For the scope of these tables, here's some rough numbers:

Data Tables:
Products:           475,000
Manufacturers:      1300
Stores:             150
General Categories: 245
Product Categories: 500

Mapping Tables:
Product Category -> Product: 655,000
Stores -> Products:          50,000,000

Now, for the actual problem: As part of our software, we need to select n random products, given a store and a general category. However, we also need to ensure a good mix of manufacturers, as in some categories, a single manufacturer dominates the results, and selecting rows at random causes the results to strongly favor that manufacturer.

The solution that is currently in place, works for most cases, involves selecting all of the rows that match the store and category criteria, partition them on manufacturer, and include their row number from within their partition, then select from that where the row number for that manufacturer is less than n, and use ROWCOUNT to clamp the total rows returned to n.

This query looks something like this:

SET ROWCOUNT 6
select p.Id, GeneralCategory_Id, Product_Id, ISNULL(m.DisplayName, m.Name) AS Vendor, 
       MSRP, MemberPrice, FamilyImageName 
    from (select p.Id, gc.Id GeneralCategory_Id, 
            p.Id Product_Id, ctp.Store_id, Manufacturer_id, 
            ROW_NUMBER() OVER (PARTITION BY Manufacturer_id ORDER BY NEWID()) AS 'VendorOrder', 
            MSRP, MemberPrice, FamilyImageName
            from GeneralCategory gc
                inner join GeneralCategoriesToProductCategories gctpc ON gc.Id=gctpc.GeneralCategory_Id
                inner join ProductCategoryToProduct pctp on gctpc.ProductCategory_Id = pctp.ProductCategory_Id
                inner join Product p on p.Id = pctp.Product_Id
                inner join StoreToProduct ctp on p.Id = ctp.Product_id
                where gc.Id = @GeneralCategory and ctp.Store_id=@StoreId and p.Active=1 and p.MemberPrice >0) p 
    inner join Manufacturer m on m.Id = p.Manufacturer_id
    where VendorOrder <=6
order by NEWID()

SET ROWCOUNT 0

(I've tried to somewhat format it to make it cleaner, but I don't think it really helps)

Running this query with an execution plan shows that for the majority of these tables, it's doing a Clustered Index Seek. There are two operations that take up roughly 90% of the time:

  1. Index Seek (Nonclustered) on StoreToProduct: 17%. This table just contains the key of the store, and the key of the product. It seems that NHibernate decided not to make a composite key when making this table, but I'm not concerned about this at this point, as compared to the other seek...
  2. Clustered Index Seek on Product: 69%. I really have no clue how I could make this one more performant.

On categories without a lot of products, performance is acceptable (<50ms), however larger categories can take a few hundred ms, with the largest category taking 3s (which has about 170k products).

It seems I have two ways to go from this point:

  1. Somehow optimize the existing query and table indices to lower the query time. As almost every expensive operation is already a clustered index scan, I don't know what could be done there. The inner query could be tuned to not return all of the possible rows for that category, but I am unsure how to do this, and maintain the requirements (random products, with a good mix of manufacturers)
  2. Denormalize this data for the purpose of this query when doing the once a week import. However, I am unsure how to do this and maintain the requirements.

Does anyone have any input on either of these items?

© Programmers or respective owner

Related posts about sql

Related posts about sql-server